98 research outputs found

    Roller: A novel approach to web information extraction

    Get PDF
    The research regarding web information extraction focuses on learning rules to extract some selected information from web documents. Many proposals are ad-hoc and cannot benefit from the advances in machine learning; furthermore, they are likely to fade away as theWeb evolves and their intrinsic assumptions are not satisfied. Some authors have explored transforming web documents into relational data and then using techniques that got inspiration from inductive logic programming. In theory, such proposals should be easier to adapt as the Web evolves because they build on catalogues of features that can be adapted without changing the proposals themselves. Unfortunately, they are difficult to scale as the number of documents or features increases. In the general field of machine learning, there are propositio-relational proposals that attempt to provide effective and efficient means to learn from relational data using propositional techniques, but they have seldom been explored regarding web information extraction. In this article, we present a new proposal called Roller: it relies on a search procedure that uses a dynamic flattening technique to explore the context of the nodes that provide the information to be extracted; it is configured with an open catalogue of features, so that it can adapt to the evolution of the Web; it also requires a base learner and a rule scorer, which helps it benefit from the continuous advances in machine learning. Our experiments confirm that it outperforms other state-of-the-art proposals in terms of effectiveness and that it is very competitive in terms of efficiency; we have also confirmed that our conclusions are solid from a statistical point of view.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    On Learning Web Information Extraction Rules with TANGO

    Get PDF
    The research on Enterprise Systems Integration focuses on proposals to support business processes by re-using existing systems. Wrappers help re-use web ap plications that provide a user interface only. They emulate a human user who interacts with them and extracts the information of interest in a structured for mat. In this article, we present TANGO, which is our proposal to learn rules to extract information from semi-structured web documents with high precision and recall, which is a must in the context of Enterprise Systems Integration. It relies on an open catalogue of features that helps map the input documents into a knowledge base in which every DOM node is represented by means of HTML, DOM, CSS, relational, and user-defined features. Then a procedure with many variation points is used to learn extraction rules from that knowledge base; the variation points include heuristics that range from how to select a condition to how to simplify the resulting rules. We also provide a systematic method to help re-configure our proposal. Our exhaustive experimentation proves that it beats others regarding effectiveness and is efficient enough for practical purposes. Our proposal was devised to be as configurable as possible, which helps adapt it to particular web sites and evolve it when necessary.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    On Extracting Information from Semi-structured Deep Web Documents

    Get PDF
    Some software agents need information that is provided by some web sites, which is difficult if they lack a query API. Information extractors are intended to extract the information of interest automati cally and offer it in a structured format. Unfortunately, most of them rely on ad-hoc techniques, which make them fade away as the Web evolves. In this paper, we present a proposal that relies on an open catalogue of features that allows to adapt it easily; we have also devised an optimi sation that allows it to be very efficient. Our experimental results prove that our proposal outperforms other state-of-the-art proposals.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    On validating web information extraction proposals

    Get PDF
    Many people who have to make informed decisions in today’s always-on culture use information extractors to feed their systems with information that comes from human-friendly documents. Unfortunately, many proposals that validate information extractors have deficiencies that make it difficult to perform homogeneous comparisons, confirm or refute performance hypotheses, or draw unbiased conclusions. Consequently, it is very difficult to select the best-performing proposal on a sound basis. The state-of-the-art validation method overcomes many deficiencies in the previous proposals, but still overlooks the following issues: completeness of the validation datasets, that is, whether they provide a complete set of annotations or not; structure of the information, that is, whether they check the structure of the record instances extracted or just the attribute instances; and, finally, how extractions and annotations are matched. The decisions made regarding the previous issues have an impact on the effectiveness results. In this article, we have exhaustively analysed the literature and we have also highlighted the main weaknesses to tackle. We present a guideline and a method to compute the effectiveness, which complements and enhances the state-of-the-art validation method.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-1060Junta de Andalucía US-138137

    Enterprise Information Integration: New Approaches to Web Information Extraction

    Get PDF
    La manera de entender la información ha cambiado radicalmente en las últimas décadas gracias a la Web, que impulsa a las personas a hacer uso de Internet a un ritmo cada vez más vertiginoso. No es de extrañar, pues, que se haya convertido en uno de los canales de distribución de datos más usados y universalmente accesible. Sin embargo, los datos por sí solos no tienen suficiente valor; es necesario convertirlos en información a partir de la cual se pueda inferir conocimiento útil. Éste es el propósito de la inteligencia de negocio, que involucra un proceso de integración y transformación de datos en información y posterior obtención de conocimiento con el objetivo de llevar a cabo una toma de decisiones eficaz. Para que ese proceso de integración y transformación de datos tenga lugar, es necesario hacer uso de extractores de información, que son las herramientas que permiten extraer datos de la Web y dotarlos de estructura y semántica de modo que puedan ser interpretados por las personas o incorporados en procesos de negocios automáticos con el objetivo de explotarlos de una forma inteligente. En esta tesis nos centramos en el aprendizaje de reglas para extraer información de documentos web semi-estructurados y en cómo evaluar diferentes propuestas con el objetivo de obtener un ranking de una forma totalmente automática. Nuestras dos propuestas de extracción de información son TANGO y ROLLER; ambas están basadas en un catálogo abierto de características y en técnicas inductivas. Nuestra propuesta para obtener rankings se llama VENICE; proporciona un método automático, abierto y agnóstico que está basado en técnicas estadísticas. Esperamos que nuestras contribuciones en esta tesis puedan ser de utilidad tanto a investigadores como profesionales y que ayuden a reducir los costes en los proyectos que requieren extraer información de la Web

    On improving FOIL Algorithm

    Get PDF
    FOIL is an Inductive Logic Programming Algorithm to discover first order rules to explain the patterns involved in a domain of knowledge. Domains as Information Retrieval or Information Extraction are handicaps for FOIL due to the huge amount of information it needs manage to devise the rules. Current solutions to problems in these domains are restricted to devising ad hoc domain dependent inductive algorithms that use a less-expressive formalism to code rules. We work on optimising FOIL learning process to deal with such complex domain problems while retaining expressiveness. Our hypothesis is that changing the information gain scoring function, used by FOIL to decide how rules are learnt, can reduce the number of steps the algorithm performs. We have analysed 15 scoring functions, normalised them into a common notation and checked a test in which they are computed. The learning process will be evaluated according to its efficiency, and the quality of the rules according to their precision, recall, complexity and specificity. The results reinforce our hypothesis, demonstrating that replacing the information gain can optimise both the FOIL algorithm execution and the learnt rules.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-

    Optimising FOIL by new scoring functions

    Get PDF
    FOIL is an Inductive Logic Programming Algorithm to dis cover first order rules to explain the patterns involved in a domain of knowledge. Domains with a huge amount of information are handicaps for FOIL due to the explosion of the search of space to devise the rules. Current solutions to problems in these domains are restricted to devising ad hoc domain dependent inductive algorithms that use a less-expressive formalism to code rules. We work on optimising FOIL learning process to deal with such complex domain problems while retaining expressiveness. Our hypothesis is that changing the Information Gain scoring function, used by FOIL to de cide how rules are learnt, can reduce the number of steps the algorithm performs. We have analysed 15 scoring functions, normalised them into a common notation and checked a test in which they are computed. The learning process will be evaluated according to its efficiency, and the quality of the rules according to their precision, recall, complexity and specificity. The results reinforce our hypothesis, demonstrating that replacing the Information Gain can optimise both the FOIL algorithm execution and the learnt rules


    Get PDF
    Los estudios sociales sobre el mundo del trabajo coinciden en revelar que, para que exista una sociedad decente, es necesario que se promuevan unas condiciones básicas para su desarrollo que permita a todos sus integrantes unos mínimos de garantías para poder llevar una vida digna

    ARIEX: Automated ranking of information extractors

    Get PDF
    Information extractors are used to transform the user-friendly information in a web document into structured information that can be used to feed a knowledge-based system. Researchers are interested in ranking them to find out which one performs the best. Unfortunately, many rankings in the literature are deficient. There are a number of formal methods to rank information extractors, but they also have many problems and have not reached widespread popularity. In this article, we present ARIEX, which is an automated method to rank web information extraction proposals. It does not have any of the problems that we have identified in the literature. Our proposal shall definitely help authors make sure that they have advanced the state of the art not only conceptually, but from an empirical point of view; it shall also help practitioners make informed decisions on which proposal is the most adequate for a particular problem.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    On Member Labelling in Social Networks

    Get PDF
    Software agents are increasingly used to search for experts, recommend resources, assess opinions, and other similar tasks in the context of social networks, which requires to have accurate information that describes the features of the members of the network. Unfortu-nately, many member profiles are incomplete, which has motivated many authors to work on automatic member labelling, that is, on techniques that can infer the null features of a member from his or her neighbour-hood. Current proposals are based on local or global approaches; the former compute predictors from local neighbourhoods, whereas the lat-ter analyse social networks as a whole. Their main problem is that they tend to be inefficient and their effectiveness degrades significantly as the percentage of null labels increases. In this paper, we present Katz, which is a novel hybrid proposal to solve the member labelling problem using neural networks. Our experiments prove that it outperforms other pro-posals in the literature in terms of both effectiveness and efficiency.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-